A practical audio intelligence system for generating accurate, human-readable descriptions of audio clips — suitable for building searchable audio catalogs.
This system takes audio files as input and produces:
- A concise one-sentence description of what is heard
- A detailed paragraph covering temporal structure and acoustic character
- A structured tag list (controlled vocabulary)
- An ordered list of sound events (temporal breakdown)
- A confidence score for the overall description
Everything is designed for cataloging accuracy — no emotion analysis, no cinematic interpretation, no hallucinated context.
Example output:
{
"file_name": "metal_impact_01.wav",
"short_description": "A sharp metallic impact followed by a brief echo.",
"detailed_description": "The clip contains a metallic impact. The temporal sequence is: metallic impact → short echo tail. The sound has a sharp, transient attack.",
"tags": ["impact", "metallic impact", "percussive", "sharp transient", "reverb"],
"sound_events": ["metallic impact", "short echo tail"],
"confidence": 0.84
}┌─────────────────────────────────────────────────────────┐
│ Audio Analyzer — Phase 1 │
│ │
│ Audio File (.wav / .mp3 / .flac / .ogg / .m4a) │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ AudioLoader │ librosa load + normalize to 48kHz │
│ └──────┬───────┘ │
│ │ │
│ ├──────────────────────────┐ │
│ ▼ ▼ │
│ ┌──────────────┐ ┌─────────────────┐ │
│ │FeatureExtract│ │ CLAPTagger │ │
│ │ (librosa) │ │(laion/larger_ │ │
│ │ │ │ clap_general) │ │
│ │ - RMS energy │ │ │ │
│ │ - Spectral │ │ Full-clip │ │
│ │ - Transients │ │ zero-shot │ │
│ │ - Band energy│ │ classification │ │
│ │ - Silence │ │ │ │
│ └──────┬───────┘ │ Sliding window │ │
│ │ │ event detection │ │
│ │ └────────┬────────┘ │
│ │ │ │
│ └──────────┬───────────────┘ │
│ ▼ │
│ ┌──────────────┐ │
│ │ Description │ Template engine: │
│ │ Synthesizer │ tags + events + features │
│ │ │ → natural language │
│ └──────┬───────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ Serializer │ JSON / Markdown / CSV │
│ └──────┬───────┘ │
│ ▼ │
│ AudioAnalysisRecord │
│ (Pydantic validated) │
└─────────────────────────────────────────────────────────┘
| Property | CLAP + Templates | Audio LLM (e.g. Qwen-Audio) |
|---|---|---|
| Hallucination risk | None (labels are fixed) | Present |
| Consistency | Deterministic per run | Variable |
| Speed | Fast (< 2s/clip on GPU) | Slow (5–20s/clip) |
| GPU memory | ~4 GB | 14–40 GB |
| Catalog vocabulary | Controlled | Free-form |
| Phase 1 suitability | ✓ Ideal | Phase 2 enhancement |
audio_analyzer/
├── timbre.py # Root CLI entrypoint
├── analyze.py # Compatibility wrapper for single-file CLI
├── batch_process.py # Compatibility wrapper for batch CLI
├── pyproject.toml # Poetry dependency metadata
├── requirements.txt
├── setup_mac.sh # macOS Silicon setup (M1/M2/M3/M4)
├── setup_runpod.sh # RunPod GPU environment setup
│
├── config/
│ ├── config.yaml # Model, analysis, output settings
│ └── vocabulary.yaml # Controlled vocabulary (13 categories, ~194 labels)
│
├── src/
│ ├── cli/
│ │ ├── main.py # Top-level Click CLI with subcommands
│ │ ├── analyze.py # Single-file analysis command
│ │ ├── batch.py # Batch analysis command
│ │ └── cache.py # Label-cache builder command
│ └── timbre/
│ ├── config_loader.py # YAML config loader + logging setup
│ ├── pipeline.py # Main orchestrator (AudioAnalysisPipeline)
│ ├── ingestion/
│ │ └── audio_loader.py # Load + validate + normalize audio files
│ ├── models/
│ │ └── clap_tagger.py # CLAP zero-shot classification wrapper
│ ├── analysis/
│ │ ├── feature_extractor.py # Acoustic features (librosa)
│ │ ├── event_detector.py # Sliding-window event detection
│ │ └── description_synthesizer.py # Natural language description builder
│ └── output/
│ ├── schema.py # Pydantic AudioAnalysisRecord model
│ ├── serializer.py # JSON / Markdown / CSV per-file output
│ └── catalog_builder.py # Multi-file catalog aggregation
│
└── outputs/ # Default output location
├── json/ # Per-file JSON
├── markdown/ # Per-file Markdown review reports
├── catalog.md # Full catalog grouped by category
├── catalog.csv # Flat CSV catalog
└── batch_results.json # All records in one JSON array
- macOS 12.3 (Monterey) or later — required for MPS support
- Apple Silicon Mac (M1 or newer)
- Homebrew
- Python 3.10+
On Apple Silicon, PyTorch uses MPS (Metal Performance Shaders) — the GPU backend for Apple's unified memory architecture. The system detects it automatically in order of priority:
CUDA (NVIDIA) → MPS (Apple Silicon) → CPU
fp16 is automatically disabled on MPS — CLAP runs in fp32, which is correct and stable. Expect ~3–8x slower than a dedicated NVIDIA GPU but much faster than CPU-only.
cd audio_analyzer
bash setup_mac.shThis creates a .venv in the project root, installs ffmpeg via Homebrew,
PyTorch with MPS support, all Python dependencies, and pre-downloads the
CLAP model (~1.2 GB).
pyproject.toml is the dependency source of truth. The checked-in
requirements.txt is generated from Poetry during release with
poetry export --format requirements.txt --without-hashes --only main.
PyTorch installation is still platform-specific for setup:
- macOS: install torch separately before
poetry install - RunPod/CUDA: install torch, torchaudio, and torchvision together with the
matching CUDA wheel index before
poetry install
Example:
cd audio_analyzer
poetry install
poetry run timbre analyze samples/0_sample.wavThe Makefile also supports Poetry directly:
make install
make run USE_POETRY=1 FILE=samples/0_sample.wav
make batch USE_POETRY=1 DIR=./samplesPoetry console scripts are also defined:
poetry run timbre analyze samples/0_sample.wav
poetry run timbre batch ./samples
poetry run timbre vocab cache --forceActivate the virtual environment, then run:
source .venv/bin/activate
python timbre.py analyze samples/0_sample.wav
python timbre.py batch ./samples/Or skip activation and use the venv Python directly:
.venv/bin/python timbre.py analyze samples/0_sample.wavTo inspect the configured profiles:
python timbre.py analyze --list-profiles
python timbre.py batch --list-profiles
python timbre.py profile list
python timbre.py profile inspect preciseTo run with a specific profile:
python timbre.py analyze samples/0_sample.wav --profile precise
python timbre.py batch ./samples --profile fastTo run several profiles in one command:
python timbre.py analyze samples/0_sample.wav \
--profile balanced \
--profile precise \
--profile conservative
python timbre.py batch ./samples \
--profile fast \
--profile preciseTo sweep every named profile in the config:
python timbre.py analyze samples/0_sample.wav --all-profiles
python timbre.py batch ./samples --all-profilesOutputs are scoped automatically by profile name. For example, with
--profile precise and the default config, artifacts are written under
./out/precise/.
To confirm MPS is active, look for this line in the output:
[INFO] timbre.models.clap_tagger: Loading CLAP model: laion/larger_clap_general on mps
The project now supports a simple CPU-only Linux Docker image for distribution.
The container exposes the existing timbre CLI directly, uses the bundled
config/config.yaml and config/vocabulary.yaml by default, and downloads the
CLAP model from Hugging Face on first run.
docker build -t timbre .If you want to send the image directly instead of publishing it to a registry, export one tarball per CPU architecture:
make docker-export-arm64
make docker-export-amd64Or build both at once:
make docker-export-allThis produces:
dist/timbre-arm64.tarfor Apple Silicon usersdist/timbre-amd64.tarfor Intel/AMD users
The easiest way to share it with non-technical users is to send:
- the right tar file for their machine
- timbre-docker.sh
They can then load and run the image with simple commands:
bash timbre-docker.sh load
bash timbre-docker.sh analyze /path/to/example.wav
bash timbre-docker.sh batch /path/to/folderIf they prefer using Docker directly, they can still import the right file with:
docker load -i timbre-arm64.taror:
docker load -i timbre-amd64.tarMount input audio read-only and an output directory read-write:
bash timbre-docker.sh analyze /path/to/example.wavThe equivalent raw Docker command is:
docker run --rm \
-v "$PWD/samples:/data/in:ro" \
-v "$PWD/out:/data/out" \
timbre analyze /data/in/example.wav --output-dir /data/outbash timbre-docker.sh batch /path/to/folderThe equivalent raw Docker command is:
docker run --rm \
-v "$PWD/samples:/data/in:ro" \
-v "$PWD/out:/data/out" \
timbre batch /data/in --output-dir /data/outTo avoid downloading the CLAP model on every fresh container run, mount a persistent cache directory:
docker run --rm \
-v "$PWD/samples:/data/in:ro" \
-v "$PWD/out:/data/out" \
-v "$PWD/.hf-cache:/root/.cache/huggingface" \
timbre analyze /data/in/example.wav --output-dir /data/outMount your custom files and pass them through the existing CLI options:
docker run --rm \
-v "$PWD/samples:/data/in:ro" \
-v "$PWD/out:/data/out" \
-v "$PWD/config:/data/config:ro" \
timbre analyze /data/in/example.wav \
--output-dir /data/out \
--config /data/config/config.yaml \
--profile precise \
--vocab /data/config/vocabulary.yaml- This first Docker workflow is CPU-only; no CUDA or GPU container support is included yet.
- The first run may take longer because model weights are downloaded at runtime.
- The recommended contract is to mount inputs read-only, outputs writable, and optionally persist
/root/.cache/huggingface. - Manual sharing is the simplest path right now: build locally, send the correct
dist/*.tarfile plustimbre-docker.sh, and let friends use the wrapper script instead of writing Docker commands by hand.
- RunPod pod with at least one GPU (A10G, RTX 3090, A100, etc.)
- Recommended template: RunPod PyTorch 2.4 (CUDA 12.4, Ubuntu 22.04)
- Your local machine needs:
ssh,scp(both standard on macOS/Linux)
Note:
torch >= 2.6.0is required due to CVE-2025-32434. The setup script handles this upgrade automatically, including upgradingtorchvisionandtorchaudiotogether to avoid version conflicts.
In the RunPod UI, start a pod and click Connect → SSH. You'll get connection details that look like:
ssh root@194.68.245.147 -p 22017 -i ~/.ssh/id_ed25519
From your local machine, in the directory containing audio_analyzer/:
scp -P 22017 -i ~/.ssh/id_ed25519 -r ./audio_analyzer root@194.68.245.147:~/The
-Pflag (uppercase) sets the port for scp — note this differs from ssh which uses lowercase-p.
To also upload audio samples:
scp -P 22017 -i ~/.ssh/id_ed25519 -r ./my_samples root@194.68.245.147:~/audio_analyzer/samples/ssh root@194.68.245.147 -p 22017 -i ~/.ssh/id_ed25519
cd ~/audio_analyzer
bash setup_runpod.shThe setup script will:
- Auto-detect your CUDA version and select the right PyTorch wheel
- Create a
.venvvirtual environment in the project root - Install torch + torchaudio + torchvision together into the venv
- Install ffmpeg and libsndfile1
- Install all remaining Python dependencies from the release-generated
requirements.txt - Pre-download and cache the CLAP model (~1.2 GB)
- Verify everything works
Activate the virtual environment first:
source .venv/bin/activateSingle file:
python timbre.py analyze samples/0_sample.wavSingle file with all outputs:
python timbre.py analyze samples/0_sample.wav --output-dir ./outputs --markdown --fullBatch — entire folder:
python timbre.py batch ./samples/ --output-dir ./outputsIf you update the code locally and want to push changes to the pod, use scp again. Because scp always overwrites, it's safe to re-run:
# Re-upload only the src/ folder (faster than uploading everything)
scp -P 22017 -i ~/.ssh/id_ed25519 -r ./audio_analyzer/src root@194.68.245.147:~/audio_analyzer/
# Or re-upload specific files
scp -P 22017 -i ~/.ssh/id_ed25519 \
./audio_analyzer/src/timbre/models/clap_tagger.py \
root@194.68.245.147:~/audio_analyzer/src/timbre/models/Copy the outputs folder back to your local machine:
scp -P 22017 -i ~/.ssh/id_ed25519 -r \
root@194.68.245.147:~/audio_analyzer/outputs \
./outputs_from_podOr just the catalog files:
scp -P 22017 -i ~/.ssh/id_ed25519 \
root@194.68.245.147:~/audio_analyzer/outputs/catalog.md \
root@194.68.245.147:~/audio_analyzer/outputs/catalog.csv \
root@194.68.245.147:~/audio_analyzer/outputs/batch_results.json \
./outputs_from_pod/Add an entry to ~/.ssh/config so you don't have to type the full connection
string every time:
Host runpod-audio
HostName 194.68.245.147
Port 22017
User root
IdentityFile ~/.ssh/id_ed25519
Then you can use shorthand for everything:
ssh runpod-audio
scp -r ./audio_analyzer runpod-audio:~/
scp -r runpod-audio:~/audio_analyzer/outputs ./outputs_from_podpython timbre.py analyze path/to/file.wavOptions:
| Flag | Description |
|---|---|
--output-dir / -o |
Directory to save output files |
--profile |
Named profile from config.yaml (repeatable) |
--all-profiles |
Run every named profile from config.yaml |
--list-profiles |
Print configured profiles and exit |
--markdown |
Also save a per-file Markdown review report |
--full |
Save full JSON (includes metadata + acoustics) |
--no-windowed |
Disable sliding-window event detection (faster) |
--quiet / -q |
Suppress console output |
python timbre.py batch ./samples/Options:
| Flag | Description |
|---|---|
--output-dir / -o |
Root output directory |
--profile |
Named profile from config.yaml (repeatable) |
--all-profiles |
Run every named profile from config.yaml |
--list-profiles |
Print configured profiles and exit |
--catalog |
Generate catalog.md (default: on) |
--csv |
Generate catalog.csv (default: on) |
--markdown |
Save per-file Markdown reports |
--full |
Full JSON output per file |
--limit N |
Only process first N files (useful for testing) |
--no-windowed |
Disable sliding-window event detection |
Profiles let you A/B CLAP inference settings without editing code.
The runtime config is selected from config/config.yaml, merged with the base
settings, and stamped into every output record as provenance.
Common workflow:
# See the available profiles
python timbre.py analyze --list-profiles
python timbre.py profile list
# Run the **same** file with two profiles
python timbre.py analyze samples/0_sample.wav --profile balanced
python timbre.py analyze samples/0_sample.wav --profile precise
# Run several profiles in one pass
python timbre.py analyze samples/0_sample.wav \
--profile balanced \
--profile precise \
--profile sensitive
# Batch compare two profiles across a folder
python timbre.py batch ./samples --profile fast
python timbre.py batch ./samples --profile precise
# Sweep every configured profile
python timbre.py batch ./samples --all-profiles
# Inspect one profile in detail
python timbre.py profile inspect precise
python timbre.py profile inspect precise --jsonWith the default output settings, this produces separate directories such as:
out/
balanced/
fast/
precise/
Each JSON, Markdown, CSV, and catalog entry includes:
analysis_provenance.profile_nameanalysis_provenance.profile_fingerprintanalysis_provenance.config_path
CLI overrides still win over the profile. For example,
--profile precise --no-windowed disables windowed analysis for that run and
produces a different profile fingerprint in the output provenance.
When multiple requested profiles share the same model and label cache, the CLI reuses the already-loaded CLAP resources between runs so you do not pay the model load cost repeatedly.
The dedicated profile inspection command is useful when you want to review the
human-friendly label, description, effective settings, and raw YAML overrides
for a profile without opening config.yaml manually:
python timbre.py profile list
python timbre.py profile inspect balanced
python timbre.py profile inspect compact_model --json{
"file_name": "footsteps_gravel.wav",
"short_description": "Footsteps on gravel with a consistent rhythmic pace.",
"detailed_description": "The clip contains footsteps on gravel. Secondary sounds include outdoor ambience. The sound has noticeable transient elements.",
"tags": ["movement", "footsteps on gravel", "outdoor ambience", "percussive", "rhythmic"],
"sound_events": ["footsteps on gravel", "outdoor ambience"],
"confidence": 0.78
}## Impact
### `metal_hit_01.wav`
**A sharp metallic impact followed by a brief reverberant tail.**
| | |
|---|---|
| Duration | 1.23s |
| Label | metallic impact |
| Confidence | ████░ 0.84 |
| Events | metallic impact → short echo tail |
| Tags | `impact`, `metallic impact`, `percussive`, `sharp transient` |file_name,duration_seconds,primary_category,primary_label,confidence,short_description,...
metal_hit_01.wav,1.23,impact,metallic impact,0.84,A sharp metallic impact...,...
footsteps_01.wav,4.50,movement,footsteps on gravel,0.78,Footsteps on gravel...,...
default_profile: "balanced"
base:
model:
model_id: "laion/larger_clap_general"
device: null
fp16: true
vocab_file: "vocabulary.yaml"
label_cache_path: ".cache/label_cache.pt"
analysis:
use_windowed_analysis: true
window_seconds: 2.0
hop_seconds: 0.5
min_confidence: 0.25
top_k_categories: 5
output:
output_dir: "./out"
save_per_file_markdown: true
full_json: false
profiles:
balanced:
label: "Balanced"
description: "Default balance between speed, temporal detail, and category breadth."
fast:
label: "Fast"
description: "Higher throughput profile for broad folder sweeps and initial triage."
analysis:
hop_seconds: 1.0
top_k_categories: 3
precise:
label: "Precise"
description: "Finer temporal resolution and broader category search for detailed review."
analysis:
hop_seconds: 0.25
min_confidence: 0.20
top_k_categories: 7Profile overrides currently target CLAP inference behavior and related pipeline settings, including:
model.model_idmodel.devicemodel.fp16analysis.use_windowed_analysisanalysis.windowed_min_durationanalysis.window_secondsanalysis.hop_secondsanalysis.min_confidenceanalysis.top_k_categories
To run a specific profile:
python timbre.py analyze samples/0_sample.wav --profile precise
python timbre.py batch ./samples --profile fast
python timbre.py validate --input ./out/precise/json --profile preciseTo analyze and validate in one command:
python timbre.py analyze samples/0_sample.wav --profile precise --validate
python timbre.py batch ./samples --profile fast --validate
python timbre.py analyze samples/0_sample.wav --validate \
--validate-backend openai --validate-model gpt-5.4-miniDefines all labels CLAP classifies against. 13 categories, ~194 labels:
| Category | Example Labels |
|---|---|
impact |
metallic impact, glass shatter, gunshot, drum hit |
movement |
footsteps on gravel, door slam, paper rustling |
ambience |
outdoor ambience, crowd murmur, ocean waves |
weather |
heavy rain, thunder, wind howl |
machinery |
engine idle, electrical buzzing, drill |
vehicles |
car passing, motorcycle, aircraft flyover |
voices |
speech, laughter, crowd cheer |
water |
water dripping, waterfall, water splash |
animals |
bird chirping, dog bark, crickets |
textures |
low rumble, white noise, vinyl crackle |
music |
piano notes, guitar strum, rhythmic beat |
alerts |
alarm beep, siren, phone ringing |
background |
background noise, silence |
To add new labels: edit vocabulary.yaml and re-run. No retraining needed.
| File | Description | Tags | Conf |
|---|---|---|---|
metal_clang.wav |
A sharp metallic impact followed by a short echo. | impact, metallic impact, sharp transient, reverb | 0.87 |
rain_heavy.wav |
Heavy rain on a hard surface, continuous and broadband. | weather, heavy rain, broadband noise, continuous | 0.91 |
footsteps_wood.wav |
Fast footsteps on a wooden floor, ending with a door slam. | movement, footsteps on wood, door slam, rhythmic | 0.82 |
engine_idle.wav |
A low-frequency mechanical engine hum, steady and continuous. | machinery, engine idle, mechanical hum, low frequency | 0.88 |
forest_wind.wav |
Soft wind ambience with distant birds and gentle rustling. | ambience, wind ambience, bird chirping, continuous | 0.79 |
| macOS Silicon (MPS) | RunPod (CUDA) | CPU fallback | |
|---|---|---|---|
| Setup | bash setup_mac.sh |
bash setup_runpod.sh |
pip install -r requirements.txt |
| Device | MPS (auto-detected) | CUDA (auto-detected) | CPU |
| fp16 | No (fp32 only) | Yes | No |
| VRAM / RAM | ~4 GB unified memory | ~4 GB VRAM | system RAM |
| Speed (10s clip) | ~5–15s | ~2–3s | ~30–60s |
| torch wheels | Standard pip | CUDA-specific index | Standard pip |
Other notes:
pyproject.tomlis the source of truth for Python dependencies;requirements.txtis release-generated from Poetry- CLAP model size: ~1.2 GB (downloaded from HuggingFace Hub on first run, then cached)
torch >= 2.6.0required (CVE-2025-32434 — torch.load safety fix)- On RunPod:
torchvisionmust be upgraded together withtorchin a single pip command to avoid internal import conflicts in transformers
| Phase | Focus | Key Addition |
|---|---|---|
| Phase 1 | Cataloging | CLAP + templates (this system) |
| Phase 2 | Richer descriptions | Audio LLM (Qwen-Audio, SALMONN) |
| Phase 3 | Search | CLAP embeddings → vector database |
| Phase 4 | Similarity | Nearest-neighbor audio retrieval |
| Phase 5 | Streaming | Real-time pipeline (WebSocket) |